Introduction

RNA-Seq experiment on six DC populations either on C57Black6 mice vs C57black6 mice with STAT6 knockout. DC populations are defined by CD11b, CD103, CD24 status from sorting.

Quantifying reads

There are many steps involved in analysing an RNA-Seq experiment. Analysing an RNAseq experiment begins with sequencing reads. These are aligned to a reference genome, then the number of reads mapped to each gene can be counted. This results in a table of counts, which is what we perform statistical analyses on in R. While mapping and counting are important and necessary tasks, in this report we will be starting from the count data and getting stuck into analysis.

The entirety of this dataset was downloaded from a Globus link provided by Otago Genomics who performed the RNA-Sequencing on the 5th October 2019. The unique ID for the sequencing was OG4953 with 36 samples labelled split across three sequencing lanes. The 36 samples were comprised of three replicates of 12 cell populations. Six populations were labeled as B6, the other six as KO. The six cell populations were labeled as 24p, 24n, 103p, 103n, 11bp, and TN.

Trimmomatic

I used Trimmomatic v0.36 to trim our RNA-Seq reads. I removed Illumina adapters, base pairs with < 30 phred quality score, and reads less than 20 bp in length.

https://academic.oup.com/bioinformatics/article/30/15/2114/2390096

Anthony M. Bolger, Marc Lohse, Bjoern Usadel, Trimmomatic: a flexible trimmer for Illumina sequence data, Bioinformatics, Volume 30, Issue 15, 1 August 2014, Pages 2114–2120, https://doi.org/10.1093/bioinformatics/btu170

For reproducibility, we used the publicly available Docker comics/trimmomatic which was currently running trimmomatic 0.36 using the following code

# Concatenate all RNA-Seq exeperiment fastq files
for x in 1 2 3 4 5 6 7 8 9;
do
  echo merging $x
  cat CDT11ANXX-4953-0$x-50-1_S$x_L00*_R1_001.fastq.gz > combined/4953_S0${x}_R1.fastq.gz
  cat CDT11ANXX-4953-0$x-50-1_S$x_L00*_R2_001.fastq.gz > combined/4953_S0${x}_R2.fastq.gz
done

for x in 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36;
do
  echo merging $x;
  cat CDT11ANXX-4953-$x-50-1_S$x_L00*_R1_001.fastq.gz > combined/4953_S${x}_R1.fastq.gz
  cat CDT11ANXX-4953-$x-50-1_S$x_L00*_R2_001.fastq.gz > combined/4953_S${x}_R2.fastq.gz
done

# Trimmomatic 0.36 docker and code
docker run -v /stornext/BioinfoSAN/Bioinformatics/Illumina/OG4953/OG4953_fastq/combined:/data/ -it comics/trimmomatic:0.36 bash
ADAPTfile=/software/applications/Trimmomatic/0.36/adapters/TruSeq3-PE-2.fa;
for x in 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36;
echo trimming sample $x;
java -jar $TRIMMOMATIC PE -threads 12 4953_S${x}_R1.fastq.gz 4953_S${x}_R2.fastq.gz -baseout trimmed/4953_S${x}.fq.gz ILLUMINACLIP:${ADAPTfile}:2:15:8:5:true  LEADING:3  TRAILING:3  SLIDINGWINDOW:10:20  MINLEN:20;
done

STAR

We used STAR 2.7.1a implemented in an in-house Docker container to map our reads to the M16 mm10 mouse genome.

https://github.com/alexdobin/STAR

https://academic.oup.com/bioinformatics/article/29/1/15/272537

STAR: ultrafast universal RNA-seq aligner. Alexander Dobin, Carrie A. Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, Thomas R. Gingeras. Bioinformatics, Volume 29, Issue 1, January 2013, Pages 15–21, https://doi.org/10.1093/bioinformatics/bts635

We aligned to Gencode mouse primary assembly M16 https://www.gencodegenes.org/mouse/release_M16.html

# STAR 2.7.1a docker and code
docker run -v /stornext/BioinfoSAN/Bioinformatics/Illumina/OG4953/OG4953_fastq/combined/trimmed:/data/ -it samold_staraligner/star2.7.1a:latest bash
STAR --runThreadN 12 -- runMode genomeGenerate --genomeDir /data/STAR_Mapped/star_genome_directory --genomeFastaFiles /data/STAR_Mapped/star_genome/GRCm38.primary_assembly.genome.fa --sjdbGTFfile /data/STAR_Mapped/star_genome/gencode.vM16.primary_assembly.annotation.gff3
for x in 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36;
do echo mapping $x;
STAR --runThreadN 12 --genomeDir /data/STAR_mapped/star_genome_directory --readFilesIn /data/4953_S${x}_1P.fq.gz /data/4953_S${x}_2P.fq.gz --readFilesCommand gunzip -c --outFileNamePrefix /data/STAR_mapped/output/${x}_STAR_9.Oct.19_
done

featureCounts

featureCounts 1.28.1 on R 3.4.4 “Someone to lean on” was used to count the mapped reads directly from the SAM file output from STAR.

https://www.ncbi.nlm.nih.gov/pubmed/24227677

> citation("Rsubread")

Please cite Rsubread as below:

  Liao Y, Smyth GK and Shi W (2013). The Subread aligner: fast,
  accurate and scalable read mapping by seed-and-vote. Nucleic Acids
  Research, 41(10):e108.

A BibTeX entry for LaTeX users is

  @Article{,
    title = {The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote},
    author = {Yang Liao and Gordon K. Smyth and Wei Shi},
    journal = {Nucleic Acids Research},
    year = {2013},
    volume = {41},
    issue = {10},
    pages = {e108},
  }

> R.Version() 
$platform
[1] "x86_64-pc-linux-gnu"

$arch
[1] "x86_64"

$os
[1] "linux-gnu"

$system
[1] "x86_64, linux-gnu"

$status
[1] ""

$major
[1] "3"

$minor
[1] "4.4"

$year
[1] "2018"

$month
[1] "03"

$day
[1] "15"

$`svn rev`
[1] "74408"

$language
[1] "R"

$version.string
[1] "R version 3.4.4 (2018-03-15)"

$nickname
[1] "Someone to Lean On"

> sessionInfo(
+ )
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.6 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.0
LAPACK: /usr/lib/lapack/liblapack.so.3.0

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Rsubread_1.28.1

loaded via a namespace (and not attached):
[1] compiler_3.4.4 tools_3.4.4   

I created a Docker to use featureCounts in a reproducible manner.

# Rsubread
docker run -v /stornext/BioinfoSAN/Bioinformatics/Illumina/OG4953/OG4953_fastq/combined/trimmed/STAR_mapped:/data/ -it rsubread/afterstar:latest R
library(Rsubread)
sampleVector = c("01","02","03","04","05","06","07","08","09","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30","31","32","33","34","35","36")
samFileNames=NULL;
for(sampleName in sampleVector){
  samFileNames = c(samFileNames, paste0("/data/output/", sampleName, "_STAR_9.Oct.19_Aligned.out.sam"))
}
FCFile = featureCounts(nthreads = 12,
                       primaryOnly = TRUE,
                       isPairedEnd = TRUE,
                       checkFragLength = TRUE,
                       files = samFileNames,
                       GTF.featureType = "gene",
                       GTF.attrType = "gene_name",
                       isGTFAnnotationFile = TRUE,
                       allowMultiOverlap=F,
                       countMultiMappingReads = T,
                       annot.ext = "/data/star_genome/gencode.vM16.primary_assembly.annotation.gff3")
write.csv(FCFile$counts, file="/data/output/RNASeq_raw_counts_STAR_FR_2019-10-09.csv")
write.csv(FCFile$stat, file="/data/output/FCFile$stat_STAR_FR_2019-10-09.csv")

I count multi-mapped reads because I align to the genome, and best practise reccomends that they are not discarded: “Genomic multireads are primarily due to repetitive sequences or shared domains of paralogous genes. They normally account for a significant fraction of the mapping output when mapped onto the genome and should not be discarded.”

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0881-8

A survey of best practices for RNA-seq data analysis Ana Conesa, Pedro Madrigal, Sonia Tarazona, David Gomez-Cabrero, Alejandra Cervera, Andrew McPherson, Michał Wojciech Szcześniak, Daniel J. Gaffney, Laura L. Elo, Xuegong Zhang & Ali Mortazavi Genome Biology Volume 17, Article number: 13 (2016)

Quality Analysis of RNA Sequencing reads & replicates

Pearson & Spearman correlations of Replicates

Pearson correlation scores between replicates when top 1% outliers and transcripts with less than 5 reads (on average) are removed.

Pearson Correlation matrices between each sample with outliers removed. Outliers are the top 1% most reads transcripts, and transcripts with less than 5 reads. Mouseoverable plot.

Sample 14 is clearly too much of an outlier to even include in any Quality analysis, so I will remove Sample 14 from all analyses from now on.

Pearson Correlation matrices between each sample with outliers removed. Outliers are the top 1% most reads transcripts, and transcripts with less than 5 reads. Mouseoverable plot.

Spearman Correlation matrices between each sample with outliers removed. Outliers are the top 1% most reads transcripts, and transcripts with less than 5 reads. Mouseoverable plot.

Conclusion: Triple replicates are all strongly correlated with both Pearson & Spearman correlations > 0.9 R\(^2\) when top 1% outliers and transcripts less than 5 reads are removed. (These cutoffs are chosen from Immgen standards if citation needed).

PCA on total RNA-Seq data for the Tfh experiment

This is the same PCA repeated 4 times to show different labels

PCA summaries

  1. We can see that most replicates cluster well on a PCA.
  2. We can see that STAT6KO status has little effect on PCA clusters.
  3. No overall trend STAT6KO vs B6.
  4. PCA shows if cells have been sampled from ear, mediastinal LN, small intestine.

Conclusion

I’ve provided in this document all code required for recalculation of results when the RNA-Seq data is made publicly available, with citations. Not a complete document, there is still analysis to be performed.